Running environment¶
Packages installation¶
# Install Packages
%pip install pandas
%pip install datetime
%pip install scipy
%pip install plotly
%pip install scikit-learn
Requirement already satisfied: pandas in /opt/homebrew/Cellar/jupyterlab/4.2.3/libexec/lib/python3.12/site-packages (2.2.2) Requirement already satisfied: numpy>=1.26.0 in /opt/homebrew/Cellar/jupyterlab/4.2.3/libexec/lib/python3.12/site-packages (from pandas) (2.1.0) Requirement already satisfied: python-dateutil>=2.8.2 in /opt/homebrew/Cellar/jupyterlab/4.2.3/libexec/lib/python3.12/site-packages (from pandas) (2.9.0.post0) Requirement already satisfied: pytz>=2020.1 in /opt/homebrew/Cellar/jupyterlab/4.2.3/libexec/lib/python3.12/site-packages (from pandas) (2024.1) Requirement already satisfied: tzdata>=2022.7 in /opt/homebrew/Cellar/jupyterlab/4.2.3/libexec/lib/python3.12/site-packages (from pandas) (2024.1) Requirement already satisfied: six>=1.5 in /opt/homebrew/Cellar/jupyterlab/4.2.3/libexec/lib/python3.12/site-packages (from python-dateutil>=2.8.2->pandas) (1.16.0) [notice] A new release of pip is available: 24.0 -> 24.2 [notice] To update, run: /opt/homebrew/Cellar/jupyterlab/4.2.3/libexec/bin/python -m pip install --upgrade pip Note: you may need to restart the kernel to use updated packages. Requirement already satisfied: datetime in /opt/homebrew/Cellar/jupyterlab/4.2.3/libexec/lib/python3.12/site-packages (5.5) Requirement already satisfied: zope.interface in /opt/homebrew/Cellar/jupyterlab/4.2.3/libexec/lib/python3.12/site-packages (from datetime) (7.0.3) Requirement already satisfied: pytz in /opt/homebrew/Cellar/jupyterlab/4.2.3/libexec/lib/python3.12/site-packages (from datetime) (2024.1) Requirement already satisfied: setuptools in /opt/homebrew/Cellar/jupyterlab/4.2.3/libexec/lib/python3.12/site-packages (from zope.interface->datetime) (70.2.0) [notice] A new release of pip is available: 24.0 -> 24.2 [notice] To update, run: /opt/homebrew/Cellar/jupyterlab/4.2.3/libexec/bin/python -m pip install --upgrade pip Note: you may need to restart the kernel to use updated packages. Requirement already satisfied: scipy in /opt/homebrew/Cellar/jupyterlab/4.2.3/libexec/lib/python3.12/site-packages (1.14.1) Requirement already satisfied: numpy<2.3,>=1.23.5 in /opt/homebrew/Cellar/jupyterlab/4.2.3/libexec/lib/python3.12/site-packages (from scipy) (2.1.0) [notice] A new release of pip is available: 24.0 -> 24.2 [notice] To update, run: /opt/homebrew/Cellar/jupyterlab/4.2.3/libexec/bin/python -m pip install --upgrade pip Note: you may need to restart the kernel to use updated packages. Requirement already satisfied: plotly in /opt/homebrew/Cellar/jupyterlab/4.2.3/libexec/lib/python3.12/site-packages (5.24.0) Requirement already satisfied: tenacity>=6.2.0 in /opt/homebrew/Cellar/jupyterlab/4.2.3/libexec/lib/python3.12/site-packages (from plotly) (9.0.0) Requirement already satisfied: packaging in /opt/homebrew/Cellar/jupyterlab/4.2.3/libexec/lib/python3.12/site-packages (from plotly) (24.1) [notice] A new release of pip is available: 24.0 -> 24.2 [notice] To update, run: /opt/homebrew/Cellar/jupyterlab/4.2.3/libexec/bin/python -m pip install --upgrade pip Note: you may need to restart the kernel to use updated packages. Requirement already satisfied: scikit-learn in /opt/homebrew/Cellar/jupyterlab/4.2.3/libexec/lib/python3.12/site-packages (1.5.1) Requirement already satisfied: numpy>=1.19.5 in /opt/homebrew/Cellar/jupyterlab/4.2.3/libexec/lib/python3.12/site-packages (from scikit-learn) (2.1.0) Requirement already satisfied: scipy>=1.6.0 in /opt/homebrew/Cellar/jupyterlab/4.2.3/libexec/lib/python3.12/site-packages (from scikit-learn) (1.14.1) Requirement already satisfied: joblib>=1.2.0 in /opt/homebrew/Cellar/jupyterlab/4.2.3/libexec/lib/python3.12/site-packages (from scikit-learn) (1.4.2) Requirement already satisfied: threadpoolctl>=3.1.0 in /opt/homebrew/Cellar/jupyterlab/4.2.3/libexec/lib/python3.12/site-packages (from scikit-learn) (3.5.0) [notice] A new release of pip is available: 24.0 -> 24.2 [notice] To update, run: /opt/homebrew/Cellar/jupyterlab/4.2.3/libexec/bin/python -m pip install --upgrade pip Note: you may need to restart the kernel to use updated packages.
# Enable Plotly in Jupyter Notebook
import plotly.offline as pyo
pyo.init_notebook_mode(connected=True)
1. Logistics and Product Development in the Automobile Industry¶
# Import packages
import pandas as pd
import datetime
# Read data
produktionsdatum = pd.read_csv("Data/Logistikverzug/Komponente_K7.csv", delimiter=';')
wareneingang = pd.read_csv("Data/Logistikverzug/Logistikverzug_K7.csv")
# Check if existing missing values
if produktionsdatum.isna().any().any():
produktionsdatum.dropna(inplace=True)
if wareneingang.isna().any().any():
wareneingang.dropna(inplace=True)
# Transform data
produktionsdatum["Issued"] = pd.to_datetime(produktionsdatum["Produktionsdatum"]) + datetime.timedelta(days=1)
wareneingang["Logistics"] = pd.to_datetime(wareneingang["Wareneingang"])
# Merge into dataframe
# Keep 'produktionsdatum' columns as-is and add '_x' suffix to 'wareneingang' columns
combined = pd.merge(produktionsdatum, wareneingang, on='IDNummer', how='outer', suffixes=["", "_x"])
# Calculate the logistics delay
combined['Logistics Delay'] = (combined['Logistics'] - combined['Issued']).dt.days
# Create the 'final' and avoids modifying 'combined'
# Prevent SettingWithCopyWarning
final = combined[['IDNummer', 'Issued', 'Logistics', 'Logistics Delay', 'Fehlerhaft']].copy()
final.set_index('IDNummer', inplace=True)
print(final.head())
Issued Logistics Logistics Delay Fehlerhaft IDNummer K7-113-1132-1 2012-11-13 2012-11-18 5 0 K7-113-1132-10 2012-11-13 2012-11-19 6 0 K7-113-1132-100 2012-11-14 2012-11-18 4 0 K7-113-1132-1000 2012-11-23 2012-11-30 7 0 K7-113-1132-10000 2013-02-16 2013-02-22 6 0
(a) To determine the logistics delay distribution, We started with descriptive statistics and then applied the Kolmogorov-Smirnov test. This test checks how well our data matches a normal distribution, which is appropriate for discrete data and assumes the mean and variance are known. Upon further examination, we found that the distribution is not normal.
# Import packages
from scipy import stats
# Descriptive statistics for the logistics delay
desc_stats = final['Logistics Delay'].describe()
# print(desc_stats)
# Kolmogorov-Smirnov test
stat, p_value = stats.kstest(final['Logistics Delay'], 'norm', args=(final['Logistics Delay'].mean(), final['Logistics Delay'].std()))
if p_value < 0.05:
print("It is not a normal distribution.")
else:
print("It is a normal distribution.")
It is not a normal distribution.
(b) The mean logistics delay is 4.34 days, with a standard deviation showing that delays vary by less than a day from the average, indicating consistency. Most delays are close to this mean, reflecting a stable and predictable pattern.
The following code provides the descriptive statistics of the logistics business day delay. Based on this information, an alternative is to review the processes and address bottlenecks to improve operational speed.
import numpy as np
# Calculate the logistics business day delay
final['Logistics Business Days Delay'] = final.apply(
lambda row: np.busday_count(row['Issued'].date(), row['Logistics'].date()), axis=1)
print(final['Logistics Business Days Delay'].describe())
count 306490.000000 mean 4.343541 std 0.887599 min 1.000000 25% 4.000000 50% 4.000000 75% 5.000000 max 10.000000 Name: Logistics Business Days Delay, dtype: float64
(c) We chose Sturges' Rule to determine the number of bins because it's simple and often used for moderate-sized datasets with 306490 data points.
import pandas as pd
import plotly.graph_objects as go
# Sturges' Rule: methods for determining the number of bins
num_bins = int(np.ceil(np.log2(len(final['Logistics Business Days Delay'])) + 1))
# Create a Figure object
fig = go.Figure()
# Add histogram trace
fig.add_trace(go.Histogram(
x=final['Logistics Business Days Delay'],
name='Probability',
nbinsx=num_bins,
histnorm='probability', # Normalize histogram to probability
marker_color='blue',
opacity=0.75
))
# Add density trace
fig.add_trace(go.Histogram(
x=final['Logistics Business Days Delay'],
name='Density',
nbinsx=num_bins,
histnorm='density', # Normalize histogram to density
marker_color='red',
opacity=0.5
))
# Update layout
fig.update_layout(
title='Distribution of Logistics Business Days Delay',
xaxis_title='Logistics Business Days Delay',
yaxis_title='Density / Probability',
barmode='overlay'
)
# Show the figure
fig.show()
(d) The decision tree uses the Year feature for classification but incorrectly labels all components as non-defective. The visualizations show that defects happened between 2011 and 2014, but there were also non-defective items in this period. This means neither the decision tree nor the visualizations are capturing the defect patterns correctly.
import pandas as pd
from sklearn.model_selection import train_test_split
from sklearn.tree import DecisionTreeClassifier, export_text
from sklearn.metrics import classification_report
import warnings
warnings.filterwarnings('ignore')
# Read data again
data = pd.read_csv("Data/Logistikverzug/Komponente_K7.csv", delimiter=';')
data = data.drop('IDNummer', axis=1)
data = data.drop('Unnamed: 0', axis=1)
# Convert the 'Date' column to datetime
data['Produktionsdatum'] = pd.to_datetime(data['Produktionsdatum'])
# Extract the year
data['Year'] = data['Produktionsdatum'].dt.year
data = data.drop('Produktionsdatum', axis=1)
# Preprocess the data
# Assume 'Defective' is the target variable and the rest are features
X = data.drop('Fehlerhaft', axis=1)
y = data['Fehlerhaft']
# Split the data
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.3, random_state=42)
# Create and train the decision tree model
model = DecisionTreeClassifier()
model.fit(X_train, y_train)
# Evaluate the model
y_pred = model.predict(X_test)
print(classification_report(y_test, y_pred))
# Visualize the decision tree
tree_rules = export_text(model, feature_names=list(X.columns))
print(tree_rules)
precision recall f1-score support
0 1.00 1.00 1.00 91942
1 0.00 0.00 0.00 5
accuracy 1.00 91947
macro avg 0.50 0.50 0.50 91947
weighted avg 1.00 1.00 1.00 91947
|--- Year <= 2011.50
| |--- Year <= 2010.50
| | |--- class: 0
| |--- Year > 2010.50
| | |--- class: 0
|--- Year > 2011.50
| |--- class: 0
import pandas as pd
import plotly.graph_objects as go
df = pd.DataFrame(data)
# Check the defective component
# print(data[data['Fehlerhaft'] == 1])
# Create 3D scatter plot
fig = go.Figure()
# Add trace for Fehlerhaft = 0
fig.add_trace(go.Scatter3d(
x=df[df['Fehlerhaft'] == 0]['Herstellernummer'],
y=df[df['Fehlerhaft'] == 0]['Werksnummer'],
z=df[df['Fehlerhaft'] == 0]['Year'],
mode='markers',
marker=dict(size=5, color='blue', opacity=0.5),
name='Fehlerhaft = 0'
))
# Add trace for Fehlerhaft = 1
fig.add_trace(go.Scatter3d(
x=df[df['Fehlerhaft'] == 1]['Herstellernummer'],
y=df[df['Fehlerhaft'] == 1]['Werksnummer'],
z=df[df['Fehlerhaft'] == 1]['Year'],
mode='markers',
marker=dict(size=3, color='red', opacity=0.8),
name='Fehlerhaft = 1'
))
# Update layout
fig.update_layout(
title='3D Scatter Plot of Fehlerhaft Status',
scene=dict(
xaxis_title='Herstellernummer',
yaxis_title='Werksnummer',
zaxis_title='Year'
)
)
# Show the figure
fig.show()
2. Data Storage in Separate Files¶
Storing data in separate files enhances organization, efficiency, data integrity, and scalability over a single large table. This setup represents a relational database structure.
3. Parts T16 in Registered Vehicles¶
There is no part T16 ended up in vehicles registered in Adelshofen.
import pandas as pd
# Read data
df = pd.read_csv("Data/Zulassungen/Zulassungen_alle_Fahrzeuge.csv", delimiter=";")
# Filter out "Adelshofen"
df_adel = df[df['Gemeinden']=='ADELSHOFEN']
# Determine how many parts T16 ended up in vehicles registered in Adelshofen.
count = df[df['IDNummer'].str.startswith('16')]
print(len(count))
0
4. Attributes of the Registration Table¶
Characteristics of the data types:
- int64: Efficient for numerical operations and comparisons where whole numbers are required.
- object: Flexible for textual data but less efficient for numerical operations or date/time calculations.
# Package Installations
%pip install tabulate
Requirement already satisfied: tabulate in /opt/homebrew/Cellar/jupyterlab/4.2.3/libexec/lib/python3.12/site-packages (0.9.0) [notice] A new release of pip is available: 24.0 -> 24.2 [notice] To update, run: /opt/homebrew/Cellar/jupyterlab/4.2.3/libexec/bin/python -m pip install --upgrade pip Note: you may need to restart the kernel to use updated packages.
import pandas as pd
# Identify the data types of the attributes
df_dtype = df.dtypes
print(df_dtype.to_markdown(headers=[]))
|:-----------|:-------| | Unnamed: 0 | int64 | | IDNummer | object | | Gemeinden | object | | Zulassung | object |
5. Linear Model for Mileage¶
import pandas as pd
from sklearn.model_selection import train_test_split
from sklearn.linear_model import LinearRegression, Lasso, Ridge
from sklearn.metrics import mean_squared_error, r2_score
from sklearn.preprocessing import LabelEncoder
# Read data
df = pd.read_csv('Data/Fahrzeug/Fahrzeuge_OEM1_Typ11_Fehleranalyse.csv')
# Encode categorical variables
label_encoder = LabelEncoder()
df['engine_encoded'] = label_encoder.fit_transform(df['engine'])
# Try all variables
X = df[['Herstellernummer', 'Werksnummer', 'days', 'fuel', 'engine_encoded']]
y = df['Fehlerhaft_Fahrleistung']
# Split the data
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)
# Train the model
model = LinearRegression()
model.fit(X_train, y_train)
# Evaluate the model
y_pred = model.predict(X_test)
# Display the coefficients
coefficients = pd.DataFrame(model.coef_, X.columns, columns=['Coefficient'])
print(coefficients)
Coefficient Herstellernummer 0.000000 Werksnummer 50.750496 days -0.075439 fuel 5230.074860 engine_encoded 4616.606072
- Werksnummer: 50.750496
- Each unit increase in Werksnummer adds about 50.75 units to mileage.
- fuel: 5230.074860
- Each unit increase in fuel consumption boosts mileage by approximately 5230.07 units.
- engine_encoded: 4616.606072
- A "large" engine increases mileage by about 4616.61 units compared to a "small" engine.
In summary, these three factors—Werksnummer, fuel consumption, and engine size—affect mileage.
# Select variable
X = df[['Werksnummer', 'fuel', 'engine_encoded']]
y = df['Fehlerhaft_Fahrleistung']
# Train the model for LinearRegression
model1 = LinearRegression()
model1.fit(X_train, y_train)
# Evaluate the model
y_pred = model1.predict(X_test)
mse = mean_squared_error(y_test, y_pred)
r2 = r2_score(y_test, y_pred)
print(f'Mean Squared Error (LinearRegression): {mse}')
print(f'R-squared (LinearRegression): {r2}')
# Train the model for Lasso
model2 = Lasso()
model2.fit(X_train, y_train)
# Evaluate the model
y_pred = model2.predict(X_test)
mse = mean_squared_error(y_test, y_pred)
r2 = r2_score(y_test, y_pred)
print(f'Mean Squared Error (Lasso): {mse}')
print(f'R-squared (Lasso): {r2}')
# Train the model for Ridge
model3 = Ridge()
model3.fit(X_train, y_train)
# Evaluate the model
y_pred = model3.predict(X_test)
mse = mean_squared_error(y_test, y_pred)
r2 = r2_score(y_test, y_pred)
print(f'Mean Squared Error (Ridge): {mse}')
print(f'R-squared (Ridge): {r2}')
Mean Squared Error (LinearRegression): 123894112.7151807 R-squared (LinearRegression): 0.4884502316291339 Mean Squared Error (Lasso): 123895518.13207746 R-squared (Lasso): 0.4884444287651216 Mean Squared Error (Ridge): 123894147.79356399 R-squared (Ridge): 0.4884500867930439
The Mean Squared Error (MSE) for Linear Regression, Lasso, and Ridge models are similar, indicating significant average error in mileage predictions. An R-squared value of 0.488 means the models explain about 48.8% of the mileage variance, leaving 51.2% unexplained. This suggests moderate explanatory power and room for improvement. All models show similar performance in predicting mileage.
Based on the model, we recommend that OEM1 enhance fuel efficiency, as it significantly impacts mileage. We also recommend considering the advantages of larger engines, which notably increase mileage. Lastly, we recommend further data collection and model refinement to improve predictive accuracy and address the unexplained variance.
6. Hit and Run Accident Investigation¶
The code reads vehicle registration data and combines CSV files that start with "Bestandteile" It then finds the registration location of vehicles with a specific body part number and prints it.
import os
import pandas as pd
# Load the dataset, columns = ['Unnamed: 0', 'IDNummer', 'Gemeinden', 'Zulassung']
zulassung = pd.read_csv('Data/Zulassungen/Zulassungen_alle_Fahrzeuge.csv', delimiter=";")
# List to store DataFrames
dfs = []
# Iterate over all files in the directory
for filename in os.listdir('Data/Fahrzeug'):
# Check if the file starts with "Bestandteil" and ends with ".csv"
if filename.startswith('Bestandteile') and filename.endswith('.csv'):
file_path = os.path.join('Data/Fahrzeug', filename)
df = pd.read_csv(file_path, delimiter=";")
dfs.append(df)
# Concatenate DataFrames
combined = pd.concat(dfs, ignore_index=True)
# Filter combined based on ID_Karosserie
IDNummer_hit_and_run = combined[combined['ID_Karosserie']=="K5-112-1122-79"]['ID_Fahrzeug']
# Filter zulassung based on IDNummer_hit_and_run
zulassungsort = zulassung[zulassung['IDNummer'].isin(IDNummer_hit_and_run)]['Gemeinden']
# Print out the answer
print(zulassungsort.values[0])
ASCHERSLEBEN
Based on the analysis, the vehicle with the body part number “K5-112-1122-79” was registered in ASCHERSLEBEN.